Table of Contents
1 Read the .fits raw data
2 Preprocessing data
2.1 Time sample analysis
2.2 Resample the time series data
2.3 Create eclipse labels
Read the .fits raw data¶
For reading the .fits data-set one may use the utils library. It will get a HUD table with 3 items, and only the last one (index = 2) has the proper information for experimental analysis.
[1]:
import utils
folder_list = [ './database/raw_fits/confirmed_targets',
'./database/raw_fits/red_giants',
'./database/raw_fits/bright_stars' ]
dread = utils.data_helper.reader()
curves = dread.from_folder(folder=folder_list[0], label='confirmed targets', index=2)
#curves += dread.from_folder(folder=folder_list[1], label='red giants', index=2)
#curves += dread.from_folder(folder=folder_list[2], label='bright stars', index=2)
INFO:reader_log:Reader module created...
INFO:reader_log:Reading 37 curve packages...
[2]:
import seaborn as sns
sns.set(context="paper",style="darkgrid",palette="rocket_r")
%matplotlib notebook
curves[0].show_feature(feature='WHITEFLUXSYS')
Preprocessing data¶
This is just a simple analysis to manipulate some properties of the data before introducing the machine learning analysis. Actually the main idea is to prepare the time series data in a feasible manner for a simple machine learning analysis afterwards.
First, the data will be treated to ensure a discrete time series as features, by using some resampling techniques, since most of the time series present non-evenlly spaced samples.
After that, it is necessary to create the labels to indicate the regions of eclipse along the time series, like True = 'eclipse' and False = 'not eclipse'. To achieve this goal, a filtering technique is presented to remove the noise from the data, since the data is highlly contaminated. With the noise free time series, it is possible to compute the discrete derivative of the time-series and check the regions of high variation… those regions characterize the big light variation during the
eclipse and can be used to map the periods where there are eclipses.
Time sample analysis¶
To analyse the time sample of the series, one can do a candle plot for each time-series curve. At each candle plot it is represented the distribution of the diference t[k] - t[k-1] for k representign each sample of the time-series.
[3]:
import plotly.graph_objects as go
import pandas as pd
import numpy as np
from datetime import datetime
dt,dt_max,dt_min,dt_mean = [],[],[],[]
for curve in curves:
diff_time = np.diff(curve["DATE"])
dt.append([x.total_seconds() / 60 for x in diff_time])
dt_mean.append(np.mean(dt[-1]))
dt_max.append(np.percentile(dt[-1], 75))
dt_min.append(np.percentile(dt[-1], 25))
dt_min = np.mean(dt_mean) - 1.05 * ( np.mean(dt_mean) - np.mean(dt_min) )
dt_max = np.mean(dt_mean) + 0.01 * ( np.mean(dt_mean) - np.mean(dt_max) )
fig = go.Figure(data=[go.Box(y = value) for value in dt])
# format the layout
fig.update_layout(
xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
yaxis=dict(zeroline=False, gridcolor='white'),
paper_bgcolor='rgb(233,233,233)',
plot_bgcolor='rgb(233,233,233)' )
#fig.update_yaxes(range=[np.mean(dt_min), np.mean(dt_max)])
fig.update_yaxes(range=[dt_min, dt_max]) # <- Comment this line to see the outliers!!
fig.show()